DYNet: A Printed Book Detection Model Using Dual Kernel Neural Networks

Target detection has always been a hotspot in image processing/computer vision research, and small-target detection is a frequently encountered problem in the field of target detection. With the continuous innovation of target detection technology, people always hope that the detection of small targets can reach the real-time accuracy of large-target detection. In this paper, a small-target detection model based on dual-core convolutional neural networks (CNN) is proposed, which is mainly used for the intelligent detection of books in the production line of printed books. The model is mainly composed of two modules, including a region prediction module and suspicious target search module. The region prediction module uses a CNN to predict suspicious region blocks in a large context. The suspicious target search module uses a different CNN from the above to find tiny targets in the predicted region blocks. Comparative testing of four small book target samples using this model shows that this model has better book small-target detection accuracy compared to other models.


Introduction
Books are crucial to human society as an important medium for cultural inheritance and knowledge accumulation.However, due to the differences in printing equipment and processes of different printing manufacturers, the quality of book printing is often highly variable, affecting the quality of books and reading experience of readers.Therefore, intelligent book printing quality detection is an important research direction.This study was aimed at the intelligent detection and evaluation of book quality through technical means, such as computer vision and image processing.
The post-printing processes of book production include folding, assembling, and binding.The quality control of each process is particularly important.In the detection of book ladder labels, the traditional single template matching algorithm cannot meet the requirement for accurate detection because the book ladder label has a small area in the whole picture and the size is inconsistent [1].In 1992, Gengfeng et al. [2] proposed the gray-code color code detection method based on a single-board microcomputer.In 2009, Fei [3] developed a book sticker detection system based on machine vision.In 2014, Xiaoqing [4] developed a total and collator detection system for the children's hardcover book production line.The above algorithms and systems all used a single matching algorithm.This type of algorithm can detect whether more or less book stickers were attached, but there were still serious issues, such as missed or false detections.
The above issues can be addressed to some extent through edge detection and image registration techniques.In 2016, Mei and Xiangmin [5] developed an online inspection system for ladder labels during book binding.The system binarized the image and used the Roberts operator for edge detection, followed by similarity matching for registration.In 2018, Guo and Xinwen [6] proposed an adaptive ladder label detection method based YOLOv5s, a new member of the YOLO family, optimizes the size and speed of th model so that it can run on devices with more limited resources while maintaining highe accuracy.
YOLOv7 [14] is a lightweight real-time target detection algorithm proposed by th authors of YOLOv4.YOLOv7 is not only able to achieve better performance on large-scale multi-class datasets, but is also able to train and detect effectively on small-sample, small class datasets.In conclusion, the traditional template matching-based method has problems such a missed and false detections in ladder label detection.Although the edge detection and image registration techniques present improvements in terms of detection accuracy, they have high requirements for image quality.The traditional YOLO-based detection network has improved detection accuracy, but it still has difficulty identifying certain ladder labels To address the difficulties of book ladder label detection, this study designed a high precision detection model called double YOLO net (DYNet) for small targets.It combined the extraction of object regions with the classification and recognition of regions into on model.A comparative experiment revealed that compared with other detection models In contrast, in recent years, in order to achieve the accurate and rapid detection of objects in images, many scholars have proposed a variety of deep learning methods, which play a key role in the field of target detection, and an in-depth study of this will help to better characterize the method proposed in this paper.
First, the SSD (Single Shot MultiBox Detector [11]) is a popular target detection method that achieves high detection performance by using feature maps at different scales and using default boxes for each feature map location for regression and classification.
Second, the YOLO (You Only Look Once) family is another important group of target detection algorithms.The original YOLOv3 [12] introduced three scales of detection and used multi-scale prediction and three different sizes of anchors to improve the model's detection of small objects.YOLOv4 [13] further improved the detection performance by proposing the use of clustered anchor points, an improved feature extractor, and a PANet that uses an attention mechanism, which make the performance and efficiency stand out among many target detection models.
YOLOv5s, a new member of the YOLO family, optimizes the size and speed of the model so that it can run on devices with more limited resources while maintaining higher accuracy.
YOLOv7 [14] is a lightweight real-time target detection algorithm proposed by the authors of YOLOv4.YOLOv7 is not only able to achieve better performance on largescale, multi-class datasets, but is also able to train and detect effectively on small-sample, small-class datasets.
In conclusion, the traditional template matching-based method has problems such as missed and false detections in ladder label detection.Although the edge detection and image registration techniques present improvements in terms of detection accuracy, they have high requirements for image quality.The traditional YOLO-based detection network has improved detection accuracy, but it still has difficulty identifying certain ladder labels.
To address the difficulties of book ladder label detection, this study designed a highprecision detection model called double YOLO net (DYNet) for small targets.It combined the extraction of object regions with the classification and recognition of regions into one model.A comparative experiment revealed that compared with other detection models, DYNet had higher detection accuracy and could realize the real-time, fast, and accurate detection of book ladder labels.Thus, it could be applied to improve the efficiency of the actual book production line.

Principles and Methods
This study considered candidate region-based object detection algorithms and found that there were two stages: candidate region generation and object classification.In the candidate region generation stage, the candidate regions cover different parts of the input image using rectangular boxes of different sizes and shapes [15].In the target classification stage, the algorithm classifies these candidate regions and gives the location and category of the target.Generally, selective search algorithms [16] or some improved algorithms (such as SQR [16], RetinaRPN [17], and MCG [18]) are used to generate candidate regions that may contain targets.However, selective search algorithms require a significant amount of time and computational resources and generate a large number of candidate regions, and most candidate regions are redundant, which can affect the accuracy of target detection.
Therefore, in this paper, a CNN-based regional prediction module (RPM) is proposed for candidate region generation.Due to the intervention of artificial supervised learning and the characteristics of CNNs, only one calculation is needed to ensure that there are always targets in the region, which can greatly reduce the occurrence of "false positives".
Afterwards, the predicted area output from RPM is input into the suspicious target search module (STSM).The image is processed through the noise reduction algorithm module (NRAM) before input.For the original image, the target to be detected is small.However, for the output predicted area after RPM processing, it is a large or medium-sized target.By using a detection framework that performs well in detecting large objects, it is easier to obtain the desired results.As a result, certain target detection frameworks are modified to make them more suitable for detecting the desired targets.
The DYNet proposed in this paper mainly consists of two parts: RPM and STSM.Between the RPM and STSM, the NRAM is added to reduce image noise, and a quality evaluator (QE) is designed to evaluate the detection performance of this network.The input and output results of each module are shown in Figure 2.
sized target.By using a detection framework that performs well in detecting large objects, it is easier to obtain the desired results.As a result, certain target detection frameworks are modified to make them more suitable for detecting the desired targets.
The DYNet proposed in this paper mainly consists of two parts: RPM and STSM.Between the RPM and STSM, the NRAM is added to reduce image noise, and a quality evaluator (QE) is designed to evaluate the detection performance of this network.The input and output results of each module are shown in Figure 2. In terms of dataset acquisition, this study uses a Dahua industrial camera with a photoelectric sensor to take pictures of books on the production line of some printing factories, and the camera automatically takes pictures when the books pass the photoelectric sensor, and a total of about 11,000 images are acquired.After eliminating some blurred images or images with shifted shooting angles due to camera vibration, we obtain about 10,000 usable images, 7000 images for the training set, and about 2000 images for the validation set.The final effect is tested on a factory assembly line with a total of 3342 books.The source of the test set and the test results will be presented in Section 5, the Experimental Results section.

RPM
Due to the high requirements for detection accuracy and speed, a target detection network with CSPDarknet-53 [19] as the core was chosen.As shown in Figure 3, CSPDarknet-53 is based on Darknet-53 and adds a CSP convolution structure to the original neural network, which improves the detection accuracy and speed.It is a target detection network with high accuracy and speed.In terms of dataset acquisition, this study uses a Dahua industrial camera with a photoelectric sensor to take pictures of books on the production line of some printing factories, and the camera automatically takes pictures when the books pass the photoelectric sensor, and a total of about 11,000 images are acquired.After eliminating some blurred images or images with shifted shooting angles due to camera vibration, we obtain about 10,000 usable images, 7000 images for the training set, and about 2000 images for the validation set.The final effect is tested on a factory assembly line with a total of 3342 books.The source of the test set and the test results will be presented in Section 5, the Experimental Results section.

RPM
Due to the high requirements for detection accuracy and speed, a target detection network with CSPDarknet-53 [19] as the core was chosen.As shown in Figure 3, CSPDarknet-53 is based on Darknet-53 and adds a CSP convolution structure to the original neural network, which improves the detection accuracy and speed.It is a target detection network with high accuracy and speed.
This study simplified part of the backbone of the network based on industrial production requirements.Due to the large search area, the dataset used can be considered as a large-scale object detection dataset.Also, the search area was roughly the same; thus, even reducing the number of convolutional layers to less than 20 could achieve the expected effect.

Network Structure
RPM is a lite version of CSPDocknet-53, with an overall structure of 19 convolutional layers, which greatly improve the detection speed.Figure 4 shows the detailed structure of the RPM network.The prediction branch was removed from the original network, and the single residual structure was changed to a double residual structure.This study simplified part of the backbone of the network based on industrial production requirements.Due to the large search area, the dataset used can be considered as a large-scale object detection dataset.Also, the search area was roughly the same; thus, even reducing the number of convolutional layers to less than 20 could achieve the expected effect.

Network Structure
RPM is a lite version of CSPDocknet-53, with an overall structure of 19 convolutional layers, which greatly improve the detection speed.Figure 4 shows the detailed structure of the RPM network.The prediction branch was removed from the original network, and the single residual structure was changed to a double residual structure.This study simplified part of the backbone of the network based on industrial production requirements.Due to the large search area, the dataset used can be considered as a large-scale object detection dataset.Also, the search area was roughly the same; thus, even reducing the number of convolutional layers to less than 20 could achieve the expected effect.

Network Structure
RPM is a lite version of CSPDocknet-53, with an overall structure of 19 convolutional layers, which greatly improve the detection speed.Figure 4 shows the detailed structure of the RPM network.The prediction branch was removed from the original network, and the single residual structure was changed to a double residual structure.The final output of RPM consists of a quintet loss RPM , x, y, w, h, where the quartet (x, y, w, h) is the center coordinates of the detected object (x, y) as well as the height and width of the object (w, h); and the loss RPM is the loss of the final output of the RPM, which consists of two parts as shown in the following equations, where λ 1 and λ 2 are the balancing coefficients.
loss RPM = λ 1 Pr(obj) + λ 2 Pr(loc) One of them is the confidence loss of the object, Pr(obj), which indicates the probability that the detected object is a real object, as shown in the following equation, where p c is the probability that the detected object is a real object.The second is the coordinate loss Pr(loc), which represents the difference between the coordinates of the detected object and the center of the real object, as shown in the following equation, where b denotes the real coordinates and b denotes the predicted coordinates.
CNNs can learn more fine-grained features that help to localize the target in shallow convolutional structures and more semantic features that help to classify the target in deep convolutional structures [20].Therefore, it was possible to use the reduction of residual structures and stacking of residual layers in the RPM to improve the detection speed.
Three double residual structures were used in the RPM, as shown in Figure 5.This structure allows output features at different depths in the network to learn from each other and perform feature fusion between features with different depth resolutions.This not only improves detection accuracy compared to simply stacking convolutional kernels but also solves the problem of slow detection speed caused by multiple residual structures [21].
One of them is the confidence loss of the object, Pr(), which indicates the probability that the detected object is a real object, as shown in the following equation, where  is the probability that the detected object is a real object.

Pr(𝑜𝑏𝑗) = − ln(𝑝 )
The second is the coordinate loss Pr(), which represents the difference between the coordinates of the detected object and the center of the real object, as shown in the following equation, where  denotes the real coordinates and  denotes the predicted coordinates.
CNNs can learn more fine-grained features that help to localize the target in shallow convolutional structures and more semantic features that help to classify the target in deep convolutional structures [20].Therefore, it was possible to use the reduction of residual structures and stacking of residual layers in the RPM to improve the detection speed.
Three double residual structures were used in the RPM, as shown in Figure 5.This structure allows output features at different depths in the network to learn from each other and perform feature fusion between features with different depth resolutions.This not only improves detection accuracy compared to simply stacking convolutional kernels but also solves the problem of slow detection speed caused by multiple residual structures [21].

STSM
Unlike RPM, the main task of STSM is to detect small targets, so the network structure should be deepened, and some new features should be added to obtain better detection performance.

Network Structure
To achieve better detection performance, the network structure of CSPParknet-53 was modified to be more suitable for detecting ladder labels at the spine of printed books, as

STSM
Unlike RPM, the main task of STSM is to detect small targets, so the network structure should be deepened, and some new features should be added to obtain better detection performance.

Network Structure
To achieve better detection performance, the network structure of CSPParknet-53 was modified to be more suitable for detecting ladder labels at the spine of printed books, as shown in Figure 6.On the basis of the original network, a prediction branch was added, and the Mish activation function was chosen.The modified network was named STSM.
The final output of STSM is consistent with RPM, which consists of a quintuple (loss STSM , x, y, w, h), but the loss function of STSM, loss STSM , consists of three parts.When the size of the input image is S × S × B, the output of loss STSM is detected by STSM as shown in the following equation, and the STSM will iterate over all.The prediction frame will sum up the three parts of the loss with specific weights, where λ 3 , λ 4 , and λ 5 are the balancing coefficients, which will be used to obtain the most suitable value of the STSM when it is trained.shown in Figure 6.On the basis of the original network, a prediction branch was added, and the Mish activation function was chosen.The modified network was named STSM.The final output of STSM is consistent with RPM, which consists of a quintuple ( , , , , ℎ), but the loss function of STSM,  , consists of three parts.When the size of the input image is  ×  × , the output of  is detected by STSM as shown in the following equation, and the STSM will iterate over all.The prediction frame will sum up the three parts of the loss with specific weights, where  ,  , and  are the balancing coefficients, which will be used to obtain the most suitable value of the STSM when it is trained.

𝑙𝑜𝑠𝑠 = [𝜆 Pr (𝑜𝑏𝑗) + 𝜆 Pr (𝑙𝑜𝑐) + 𝜆 Pr (𝐼𝑜𝑈)]
The two parts in  are the same as RPM, which are the confidence loss Pr() and coordinate loss Pr() of the object.
The other part is IoU [22] loss Pr(); IoU is Intersection over Union (IoU), as shown in the following equation and Figure 7;  is the Intersection over Union of two targets recognized by STSM;  is the Intersection over Union of two targets recognized by STSM.

Mish Activation Function
The difference compared to RPM replaces the LeakyReLU function used by RPM and uses Mish [23] as the activation function, with the function expression shown in the following equation.

𝑓(𝑥) = 𝑥𝑡𝑎𝑛ℎ(ln(1 + 𝑒 ))
where  is the input, and () is the output.In the experiment, it was found that compared to the LeakyReLU function, the Mish function had different smoothing properties, which gave it better generalization capability.This generalization capability can enable the function to effectively optimize the results of ladder label detection and improve the accuracy of detection.

Mish Activation Function
The difference compared to RPM replaces the LeakyReLU function used by RPM and uses Mish [23] as the activation function, with the function expression shown in the following equation.
where x is the input, and f (x) is the output.In the experiment, it was found that compared to the LeakyReLU function, the Mish function had different smoothing properties, which gave it better generalization capability.This generalization capability can enable the function to effectively optimize the results of ladder label detection and improve the accuracy of detection.

Multi-Target Detection Branch
In the dataset used in this study, the width of a few ladder labels may in some cases become one-third the width of normal ladder labels.In the target detection framework, when outputting feature maps, the following equation can be used: where f l i denotes the feature map of the ith detection branch in the lth layer; W l i and b l i denote the weight and bias of the ith detection branch in the lth layer, respectively; σ denotes the activation function; and L denotes the total number of layers of the network.
The more detection branches the final output has, the more feature information can be extracted in the deep layers of the network, thereby improving the detection performance for small targets but also increasing the time complexity of network computation.
The experimental comparison showed that using three detection branches led to a lack of detection accuracy.However, when there were too many detection branches, "false positives" may occur.Therefore, to detect these smaller ladder labels and balance the detection time and accuracy, one detection branch was added after the original three detection branches.Thus, STSM had four detection branches and was equipped with better detection performance for these small targets.

NRAM
Due to the low brightness and high noise of certain samples in the self-built dataset, the edges of the target to be detected were difficult to distinguish from the background.Therefore, it was necessary to use an image filtering algorithm to visibly separate the edges of the ladder markers, making it easier for the detection algorithm to recognize the ladder markers.When dealing with image noise, the edge feature information of the object should be preserved, but the pixel values near the edges needed to be processed.Bilateral filtering is a Gaussian filtering function based on spatial distribution.Near the edges, the pixels that are farther away have less effects on the pixel values of the edges.Thus, the edges were more prominent after processing the image using the bilateral filtering algorithm.The mathematical expression is as follows: where g(x, y) denotes the pixel value of the filtered image at location x; I(y) denotes the pixel value of the original image at location y; ω is a fixed-size neighborhood centered at x; W p is the normalization coefficient, i.e., the value of the filter summation; and f and g are weight functions based on the telemetry and neighborhood distances, respectively.The weight function f was used to calculate the gray scale difference between individual pixels in the neighborhood, while g was used to calculate the distance between the neighboring and current pixel points.
Although bilateral filtering can better handle pixels at the edges, it cannot perform overall noise reduction of the image effectively.Therefore, an image filtering module of bilateral filtering combined with mean filtering was used in the process of bridging RPM and STSM to reduce image noise and preserve edge features.The mathematical expression is as follows: Sensors 2023, 23, 9880 where g(x, y) denotes the filtered image, f (u, v) denotes the input image, and K is the filter size.
During the experiments, the noise reduction module was outstandingly effective in reducing the noise of the dataset samples.

Quality Evaluator
In this study, among various types of detectors, it was found that, due to the existence of some ladder labels that were difficult to detect, it was difficult to accurately assess the performance of detectors on special samples using a single evaluation index.Thus, to verify the robustness of the proposed model, a quality evaluator was designed to evaluate the recognition level of various detectors on the experimental samples.In QE, there was no significant error in the detection accuracy of various detectors for general samples, but there was a significant difference in the detection accuracy for special samples.The process of QE in DYNet is shown in Figure 8. QE consists of the following modules: quantity checker, integrated checker, and weight voter.The integrated checker consists of three modules: fitting checker, IoU checker, and loss checker.
individual pixels in the neighborhood, while  was used to calculate the distance between the neighboring and current pixel points.
Although bilateral filtering can better handle pixels at the edges, it cannot perform overall noise reduction of the image effectively.Therefore, an image filtering module of bilateral filtering combined with mean filtering was used in the process of bridging RPM and STSM to reduce image noise and preserve edge features.The mathematical expression is as follows: where (, ) denotes the filtered image, (, ) denotes the input image, and  is the filter size.
During the experiments, the noise reduction module was outstandingly effective in reducing the noise of the dataset samples.

Quality Evaluator
In this study, among various types of detectors, it was found that, due to the existence of some ladder labels that were difficult to detect, it was difficult to accurately assess the performance of detectors on special samples using a single evaluation index.Thus, to verify the robustness of the proposed model, a quality evaluator was designed to evaluate the recognition level of various detectors on the experimental samples.In QE, there was no significant error in the detection accuracy of various detectors for general samples, but there was a significant difference in the detection accuracy for special samples.The process of QE in DYNet is shown in Figure 8. QE consists of the following modules: quantity checker, integrated checker, and weight voter.The integrated checker consists of three modules: fitting checker, IoU checker, and loss checker.

Quantity Checker
The quantity checker compared the quantity of targets detected by STSM with the prestored quantity to determine the degree to which the targets were correctly recognized.This module was mainly used for targets with larger intervals after being cut by the binding line, and the calculation process is as follows: where QC is the output of the quantity checker, num is the number of targets detected by STSM, con f is the number in the configuration file, and QC is 1 when num = con f and 0 when num ̸ = con f .

Fitting Checker
The fitting checker was used to compare the target coordinates with prestored parameters after linear fitting.Due to the linear characteristics of the target itself, the module could easily evaluate the translocated target, and the evaluation indicator used for the fitting checker was the correlation coefficient, defined as follows: where FC is the output result of the fitting checker, Cov(x, y) is the covariance between x and y, Var[x] is the variance of x, and Var[y] is the variance of y.The larger |FC| is, the better the effect of the fitting checker.

IoU Checker
The IoU calibrator used the CIoU [24] calculation and took a threshold upper limit of 0.35 to evaluate targets that were less spaced after being cut by the binding line, with the defining equation shown in the following equation.
where IC is the IoU calibrator output, which is calculated using the IoU loss Pr(IoU) output from the STSM; c is the minimum diagonal distance of the closure region that can contain both the predicted target and the real target; ρ b x , b y , bx , by is the Euclidean distance between the centroid of the predicted target and the real target; α is the weighting parameter; and v is used to measure the similarity of aspect ratio.The change that occurred in the loss checker is a weighted average of the RPM and STSM outputs, which can be expressed as follows: where LC is the loss checker output, Loss RPM is the value of the LeakyReLU loss function used in RPM, and Loss STSM is the value of the Mish loss function used in STSM.

Weighted Voting Rights
The weighted voting rights accumulated the above modules with different weights and voted.The voting results were included in the final evaluation indicators.In the weighted voter, the final experimental results can be expressed as follows: where WV is the final output result.A WV close to 1 indicates that DYNet performed better in detecting samples, while a value close to 0 indicates that the quantity of ladder labels did not match the actual quantity.

Datasets and Assessment Indicators
The four small-target sample comparison tests in this study used a total of 3342 images with a range of 10 to 30 small targets.These small-target sample datasets were collected on the production line of a printing factory, including the ladder labels for 11 types of books.The scene covered all the complex on-site backgrounds at the printing plant, including lighting intensity and on-site noise level, as well as some human random errors, such as deliberately rotating the book at random angles, which were crucial for our study.The dataset was obtained from reliable and representative sources and was properly cleaned and processed to ensure its quality and accuracy.
Examples of the four samples are shown in Figure 9, with the red boxes showing the locations of the ladder labels, Table 1 describing the details of these small-target samples, and Table 2 describing the source of these small-target samples.It is common to use the average precision (AP) and average recall (AR) to eval how well a model matches the dataset: where  is the correctly identified target,  is the target that is not recognized,  is the object that is not a target but is recognized as a target.The accuracy reflects often the target identified by the model is the real target.The higher the accuracy, better the detection performance of DYNet for samples.The recall rate is the level at w the real target is recognized by the model.It is based on the situation where the IoU v between the predicted box and real box is greater than 0.35 to determine that the predi object was the real target.The higher the recall rate, the better DYNet can identify the sample and the fewer samples were missed.
In this study, we used the output of QE as the basis for evaluating the performa of various types of detectors.When the result of the QE output, i.e., , was approach 1, it indicated that the target recognized by DYNet was more accurate.In this study took the threshold value  = 0.8 as the evaluation indicator.When  ≥ 0.8, we sidered the target as  and when  < 0.8, we considered the target as .It is common to use the average precision (AP) and average recall (AR) to evaluate how well a model matches the dataset: where TP is the correctly identified target, FP is the target that is not recognized, and FN is the object that is not a target but is recognized as a target.The accuracy reflects how often the target identified by the model is the real target.The higher the accuracy, the better the detection performance of DYNet for samples.The recall rate is the level at which the real target is recognized by the model.It is based on the situation where the IoU value between the predicted box and real box is greater than 0.35 to determine that the predicted object was the real target.The higher the recall rate, the better DYNet can identify the real sample and the fewer samples were missed.
In this study, we used the output of QE as the basis for evaluating the performance of various types of detectors.When the result of the QE output, i.e., WV, was approaching 1, it indicated that the target recognized by DYNet was more accurate.In this study, we took the threshold value WV = 0.8 as the evaluation indicator.When WV ≥ 0.8, we considered the target as TP and when WV < 0.8, we considered the target as FN.

Experimental Process
The experiments were conducted on a 3070 Ti GPU (Colorful iGame GeForce RTX 3070 Ti Vulcan OC, From Shenzhen, Guangdong, China), and the experimental process used the Darknet framework to train the model with a training set of 7000 images.
Through tuning optimization, the base learning rate was set to 0.0015 in this study, the input resolution was set to 416 × 416, and in order to match the resolution, compression and stretching operations were performed on the images.Some data augmentation operations were used to generate more training samples.In this case, the rotation angle was set to 1.5, the saturation adjustment parameter was set to 1.5, the exposure adjustment parameter was set to 1.5, and the tone adjustment parameter was set to 1.5.

Ablation Experiments
To validate the impact of each module in DYNet, ablation experiments were performed on four small-target samples, as shown in Table 3. Comparative experiments were conducted on networks with and without RPM and then on networks with and without NRAM.As STSM is a necessary detection module, it was used in all ablation experiments.
From Table 3 and Figure 10, it can be seen that DYNet using only STSM obtained the best AP value of 97.31% and AR value of 97.53% in sample 3.
Due to the high noise in samples 2 and 4, NRAM had a greater performance improvement in these two samples.However, due to the poor lighting conditions of sample 4, it was difficult for NRAM to produce satisfactory results without RPM.
When RPM was added, DYNet achieved a minimum performance improvement of 1.12% for all samples compared to the basic version.DYNet achieved a minimum of 99.87% performance for all samples when RPM and NRAM acted together.

RPM NRAM Sample 1 Sample 2 Sample 3 Sample 4 AP (%) AR (%) AP (%) AR (%) AP (%) AR (%) AP (%) AR (%)
To validate the impact of each module in DYNet, ablation experiments were performed on four small-target samples, as shown in Table 3. Comparative experiments were conducted on networks with and without RPM and then on networks with and without NRAM.As STSM is a necessary detection module, it was used in all ablation experiments.From Table 3 and Figure 10, it can be seen that DYNet using only STSM obtained the best AP value of 97.31% and AR value of 97.53% in sample 3. Due to the high noise in samples 2 and 4, NRAM had a greater performance improvement in these two samples.However, due to the poor lighting conditions of sample 4, it was difficult for NRAM to produce satisfactory results without RPM.
When RPM was added, DYNet achieved a minimum performance improvement of 1.12% for all samples compared to the basic version.DYNet achieved a minimum of 99.87% performance for all samples when RPM and NRAM acted together.

Performance Comparison
DYNet is based on the Darknet framework for the detection of book ladder markers, so in this study, we chose to compare the more novel models among the homologous target detection algorithms using the Darknet framework.The comparison results between DYNet and other target detection methods of the same type for four small-target samples To validate the impact of each module in DYNet, ablation experiments were performed on four small-target samples, as shown in Table 3. Comparative experiments were conducted on networks with and without RPM and then on networks with and without NRAM.As STSM is a necessary detection module, it was used in all ablation experiments.From Table 3 and Figure 10, it can be seen that DYNet using only STSM obtained the best AP value of 97.31% and AR value of 97.53% in sample 3. Due to the high noise in samples 2 and 4, NRAM had a greater performance improvement in these two samples.However, due to the poor lighting conditions of sample 4, it was difficult for NRAM to produce satisfactory results without RPM.
When RPM was added, DYNet achieved a minimum performance improvement of 1.12% for all samples compared to the basic version.DYNet achieved a minimum of 99.87% performance for all samples when RPM and NRAM acted together.

Performance Comparison
DYNet is based on the Darknet framework for the detection of book ladder markers, so in this study, we chose to compare the more novel models among the homologous target detection algorithms using the Darknet framework.The comparison results between DYNet and other target detection methods of the same type for four small-target samples

Ablation Experiments
To validate the impact of each module in DYNet, ablation experiments were performed on four small-target samples, as shown in Table 3. Comparative experiments were conducted on networks with and without RPM and then on networks with and without NRAM.As STSM is a necessary detection module, it was used in all ablation experiments.From Table 3 and Figure 10, it can be seen that DYNet using only STSM obtained the best AP value of 97.31% and AR value of 97.53% in sample 3. Due to the high noise in samples 2 and 4, NRAM had a greater performance improvement in these two samples.However, due to the poor lighting conditions of sample 4, it was difficult for NRAM to produce satisfactory results without RPM.
When RPM was added, DYNet achieved a minimum performance improvement of 1.12% for all samples compared to the basic version.DYNet achieved a minimum of 99.87% performance for all samples when RPM and NRAM acted together.

Performance Comparison
DYNet is based on the Darknet framework for the detection of book ladder markers, so in this study, we chose to compare the more novel models among the homologous target detection algorithms using the Darknet framework.The comparison results between DYNet and other target detection methods of the same type for four small-target samples

Ablation Experiments
To validate the impact of each module in DYNet, ablation experiments were performed on four small-target samples, as shown in Table 3. Comparative experiments were conducted on networks with and without RPM and then on networks with and without NRAM.As STSM is a necessary detection module, it was used in all ablation experiments.From Table 3 and Figure 10, it can be seen that DYNet using only STSM obtained the best AP value of 97.31% and AR value of 97.53% in sample 3. Due to the high noise in samples 2 and 4, NRAM had a greater performance improvement in these two samples.However, due to the poor lighting conditions of sample 4, it was difficult for NRAM to produce satisfactory results without RPM.
When RPM was added, DYNet achieved a minimum performance improvement of 1.12% for all samples compared to the basic version.DYNet achieved a minimum of 99.87% performance for all samples when RPM and NRAM acted together.

Performance Comparison
DYNet is based on the Darknet framework for the detection of book ladder markers, so in this study, we chose to compare the more novel models among the homologous target detection algorithms using the Darknet framework.The comparison results between DYNet and other target detection methods of the same type for four small-target samples To validate the impact of each module in DYNet, ablation experiments were performed on four small-target samples, as shown in Table 3. Comparative experiments were conducted on networks with and without RPM and then on networks with and without NRAM.As STSM is a necessary detection module, it was used in all ablation experiments.From Table 3 and Figure 10, it can be seen that DYNet using only STSM obtained the best AP value of 97.31% and AR value of 97.53% in sample 3. Due to the high noise in samples 2 and 4, NRAM had a greater performance improvement in these two samples.However, due to the poor lighting conditions of sample 4, it was difficult for NRAM to produce satisfactory results without RPM.
When RPM was added, DYNet achieved a minimum performance improvement of 1.12% for all samples compared to the basic version.DYNet achieved a minimum of 99.87% performance for all samples when RPM and NRAM acted together.

Performance Comparison
DYNet is based on the Darknet framework for the detection of book ladder markers, so in this study, we chose to compare the more novel models among the homologous target detection algorithms using the Darknet framework.The comparison results between DYNet and other target detection methods of the same type for four small-target samples

Performance Comparison
DYNet is based on the Darknet framework for the detection of book ladder markers, so in this study, we chose to compare the more novel models among the homologous target detection algorithms using the Darknet framework.The comparison results between DYNet and other target detection methods of the same type for four small-target samples are shown in Table 4 and Figure 11.The detection results of DYNet for each target are shown in Figure 12.This study noted that YOLOv8 has been published, but unfortunately the Darknet framework does not support the YOLOv8 algorithm, and it is therefore not included in the comparison.are shown in Table 4 and Figure 11.The detection results of DYNet for each target are shown in Figure 12.This study noted that YOLOv8 has been published, but unfortunately the Darknet framework does not support the YOLOv8 algorithm, and it is therefore not included in the comparison.are shown in Table 4 and Figure 11.The detection results of DYNet for each target are shown in Figure 12.This study noted that YOLOv8 has been published, but unfortunately the Darknet framework does not support the YOLOv8 algorithm, and it is therefore not included in the comparison.As shown in Table 4, AP improved by 8.7% and AR by 25.2% compared to SSD.Compared to YOLOv3, YOLOv4, and YOLOv5s, AP improved by 3.3%, 2.6%, and 1.8%, and AR improved by 6%, 5.7%, and 3.7%, respectively.Compared with the improved YOLOv5s (YOLOv5s-b in the table) proposed by Yang [9]  As can be seen from the table, with the continuous optimization and iteration of the model, its AP and AR performances improved significantly.The performance of the SSD model was relatively weak.The YOLOv3 and YOLOv4 models reached a high level of performance.The YOLOv5s and improved YOLOv5s models further improved the detection performance, with AP and AR values exceeding 98%.The YOLOv7 model achieved an impressive performance, with AP and AR values reaching 99.8% and 98.9%, respectively.Finally, the AP and AR values of the DYNet model were both 99.9%.

Conclusions
To realize real-time, fast, and accurate detection, this paper proposed the highprecision DYNet detection model for small targets based on the characteristics of the ladder label dataset.It referenced a two-stage target detection algorithm that integrated the functions of extracting object regions and classifying and recognizing regions in one model.DYNet was used to perform comparative tests on four small-target samples, and a quality evaluator was designed.The output of the quality evaluator was used as an evaluation indicator for model performance.The results show that DYNet had better detection accuracy for small targets compared to other models.
The dataset was the main factor affecting the CNN-based target detection algorithm.In this study, we used a self-built dataset as the training samples and achieved high detection accuracy during the testing process.However, DYNet will be confronted with certain challenges in dealing with special data samples that may appear in the future.Thus, it is still necessary to perform a more comprehensive evaluation and optimization of its performance and to further explore its robustness in different environments or scenarios.For target detection of different types, sizes, and angles, the adaptability of this model may be limited.

Discussion
The detection accuracy and recall rate of the DYNet high-precision detection model proposed in this paper meet actual production needs.However, because DYNet is composed of two one-stage object detection algorithms, its detection speed was slightly inferior to that of a single one-stage object detection algorithm.In practical production, DYNet can efficiently detect targets in most cases.However, there may be some delay when processing high-speed image data.This is unacceptable for real-time applications or scenarios that require a quick response.
To optimize the adaptability and detection speed of the DYNet high-precision detection model, the following measures will be considered to improve DYNet in the future: 1.
Introduce more samples and data enhancement techniques: By introducing more types, sizes, and angles of target samples and combining data enhancement techniques, DYNet's adaptability to various target detection situations can be enhanced.

2.
Optimize model architecture and algorithms: By adjusting and improving the model architecture and using more advanced target detection algorithms, the accuracy of the algorithms and processing speed can be improved.

Figure 1 .
Figure 1.Examples of ladder labels that are difficult to detect.(a) The ladder label marked by th red box was bisected by the binding line.(b) The size of the ladder label marked by the red box wa one third of the normal ladder label.

Figure 1 .
Figure 1.Examples of ladder labels that are difficult to detect.(a) The ladder label marked by the red box was bisected by the binding line.(b) The size of the ladder label marked by the red box was one third of the normal ladder label.

Figure 2 .
Figure 2. (a) Image taken by the camera.(b) Output image after RPM prediction: the red box in the figure is the suspicious target area predicted by RPM.(c) Output image obtained by NRAM cropping after RPM output, which is used as the input image of STSM.(d) Output image of STSM: the red box in the figure shows the final predicted book ladder markers.

Figure 2 .
Figure 2. (a) Image taken by the camera.(b) Output image after RPM prediction: the red box in the figure is the suspicious target area predicted by RPM.(c) Output image obtained by NRAM cropping after RPM output, which is used as the input image of STSM.(d) Output image of STSM: the red box in the figure shows the final predicted book ladder markers.

Figure 6 .
Figure 6.STSM network structure.The two parts in loss STSM are the same as RPM, which are the confidence loss Pr(obj) and coordinate loss Pr(loc) of the object.The other part is IoU [22] loss Pr(IoU); IoU is Intersection over Union (IoU), as shown in the following equation and Figure 7; area A is the Intersection over Union of two targets recognized by STSM; area B is the Intersection over Union of two targets recognized by STSM.Pr(IoU) = 1 − |area A | |area B | Sensors 2023, 23, x FOR PEER REVIEW 8 of 17

Figure 7 .
Figure 7. IoU definition: the ratio of intersection area to union area.

Figure 7 .
Figure 7. IoU definition: the ratio of intersection area to union area.

Figure 8 .
Figure 8.The role of quality evaluator in DYNet.Figure 8.The role of quality evaluator in DYNet.

Figure 8 .
Figure 8.The role of quality evaluator in DYNet.Figure 8.The role of quality evaluator in DYNet.

− bx 2 −
The specific definition is shown in the following equation, where b x , b y , b w , b h are the coordinates of the real target, and bx , by , bw , bh are the coordinates of the predicted target output by STSM., b y , bx , by = b x

Figure 10 .
Figure 10.Experimental results data for four samples.

Figure 10 .
Figure 10.Experimental results data for four samples.

Figure 10 .
Figure 10.Experimental results data for four samples.

Figure 10 .
Figure 10.Experimental results data for four samples.

Figure 10 .
Figure 10.Experimental results data for four samples.

Figure 10 .
Figure 10.Experimental results data for four samples.

Figure 12 .
Figure 12.Detection results of the four small-target samples (top to bottom: samples 1-4).

Figure 12 .
Figure 12.Detection results of the four small-target samples (top to bottom: samples 1-4).

Figure 12 .
Figure 12.Detection results of the four small-target samples (top to bottom: samples 1-4).
et al. in 2022, AP improved by 1.4% and AR by 3.2%.Compared to the YOLOv7 proposed by Wang [14] et al. in 2022, AP improved by 0.1% and AR improved by 1%.

Table 1 .
Detailed information of the four small-target samples.

Table 2 .
Sources of small-target samples.